home Today's News Magazine Archives Vendor Guide 2001 Search isdmag.com

Editorial
Today's News
News Archives
On-line Articles
Current Issue
Magazine Archives
Subscribe to ISD


Directories:
Vendor Guide 2001
Advertiser Index
Event Calendar


Resources:
Resources and Seminars
Special Sections


Information:
2001 Media Kit
About isdmag.com
Writers Wanted!
Search isdmag.com
Contact Us





Reducing Power and Area in Cell-Based Design

By Paul de Dood
Posted  04/27/01, 05:16:19 PM EDT

Power consumption is a major problem for emerging complex designs, particularly in designs requiring low power. The trend will accelerate as process technologies shrink into the ultra deep submicron range with the increased transistor count and current densities.

In a typical synthesis and place-and-route (SPR) flow, a static set of library cells is used to map a given design into its final physical implementation. The number of cells-and the amount of optimization that can be done-is limited. This approach is considered less efficient than arbitrary full-custom design. In the past, the limitations of static libraries were necessary because SPR tools couldn't efficiently automate design at the transistor level. As a result, cell generation hasn't been included in the EDA design flow and has been largely done by hand.

Furthermore, with the advent of third-party library companies, library creation is often out-sourced.

However, bringing library creation back into the design flow will remove the restrictions of a pre-determined set of library elements and provide several advantages: performance improvement of 10 to 15 percent, time-to-market reduction due to reduced design iterations, area reductions of up to 25 percent, and power reduction of 25 percent or more. The approach works with any circuit or logic style.

While standard-cell libraries generally have a wide variety of logical functions, the primary issue with static libraries is that there are a limited number of discrete transistor sizes for any given logical function. For a typical 300-cell library, each logical function-for example, a 4-input OR gate-will have from 1 to 10 electrical variants. However, there are millions of possible variations of transistor sizes, producing radically different timing behavior. For example, for a given logic function, the drive-strength of the cell can be varied, as can the beta-ratio (ratio between the p-transistor widths and the n-transistor widths). Cells that comprise more than one stage of logic can also be varied by altering the ratio of the drive strengths between stages. There are easily hundreds or thousands of potentially useful electrical variants per logical cell.

The optimum choice of transistor sizes depends on the context of the cell, including the load of the cell and the drive-strength of the previous stage. Larger transistors drive their load faster, but they load and slow the previous stage and use more power. Having a limited set of choices produces a design that has longer cycle times, uses more power, and has more area than a fully optimized design.

Power consumption

The power consumption of a static CMOS block can be approximated by CV2f, where C is the total capacitance of the block, V is the voltage, and f is the frequency of the design. Assuming the voltage and frequency is fixed for a given design, the power consumption is proportional to the total capacitance. There are additional factors that contribute to power consumption, such as power-to-ground shorting during switching and switching activity. However, if the total transistor width and interconnect capacitance are reduced, these factors will decrease as well.

There are four major benefits of using arbitrary transistor sizes, rather than a fixed library of elements:

1) Increase performance by minimizing the timing through critical paths.

2) Improve timing/design-closure by eliminating cases where no suitable library element exists, resulting in timing paths that are grossly over budget, requiring manual re-work of the design.

3) Power reduction by reducing the total transistor size of the design.

4) Area reduction by reducing the total transistor size of the design, which in turn reduces the interconnect capacitance, which further reduces power.

The effects of electrical variants on timing, power, and area will be demonstrated by an example.

Figure 1 demonstrates an example of a timing path through a standard-cell block. Suppose that the initial inverter drive-strength is fixed at 4x-for example, the output driver of a flip-flop. Also, suppose that the input capacitance of the gates being driven total 48 units, where 1 unit is the input capacitance of a 1x inverter.

Assuming P-transistors have half the drive strength of N-transistors and that the 4-input NOR gate has equal rise and fall times, its input capacitance is 3 times the capacitance of a similar drive-strength inverter (logical effort of the NOR gate is 3.0). As interconnect capacitance is becoming a significant part of delay and power, assume each wire has a capacitance of 4 units. We will attempt to find the best solution for the drive-strengths (m and n) of the two gates. The delay through the path is therefore expressed as:

T = (3.0 * m + 4) / 4 + (3.0 * n + 4) / m + (48 + 4)/ n

For example, if the first gate is an 8x NOR gate and the second gate is an 16x NOR gate, the total delay would be:

T = (3.0 * 8 + 4) / 4 + (3.0 * 16 + 4) / 8 + 52 / 16

= 7.0 + 6.5 + 3.25

= 16.75

Ignoring the fixed capacitance of the previous and following stages, the input capacitance of this path is as follows:

C = 4.0 + 3.0 * m + 4.0 + 3.0 * n units

The power consumption is proportional to the total capacitance and can be expressed in terms of units, where 1 unit is the power consumption of a 1x inverter.

As the drive-strength of the gate increases, its delay decreases, but the delay of the previous stage increases. In a typical library, there are several drive strengths for each cell. Assume that the drive strengths available are typical: 1x, 2x, 4x, 8x, 16x, and so forth. The optimum timing solution for such a library is:

m = 8x, n = 16x

This solution generates a delay of 16.75 units, with

a power consumption of 80.0 units (see Figure 2).

If the target cycle time is 16.75, it's unlikely that such a path using the fixed library elements would attract any attention; timing is met and the drive-strengths taper up nicely. The second NOR gate uses a fair amount of power, but no other solution meets timing, and therefore can't be avoided.

One of the advantages of creating derivative cells is improved timing and design closure. Consider the case where the standard-cell library contains only a 1x 4-input NOR gate. In such a case, m=1x, n=1x (the only solution possible), which gives a delay of 60.75 units, nearly 4 times the required cycle time. Clearly, this is unacceptable and the design will need to be re-worked.

Inserting buffers will sometimes alleviate the problem, but the buffers create additional delay, resulting in timing that still fails. It doesn't help to add to the library higher drive gates that consist of a smaller drive gate followed by a buffer. In practice, this is the same as with buffer insertion, with the added constraint of not being able to vary the drive-strengths independently. The lack of single-stage, higher drive-strength variants for many gates, which are expected to be used infrequently, is a common problem with many standard-cell libraries.

Automatically building the derivative cell eliminates this problem and makes design closure easier. By allowing the library elements to be arbitrarily sized, the drive-strengths, which result in, the optimum delay is as follows:

m = 7.0x,